An introduction to plotting with ggplot2

With real biological data

Author

Dr. Emily Johnson

Why use ggplot2?

In this exercise, you’ll use data based on a real study rather than one of R’s inbuilt datasets. This gives you experience working with real biological data and shows you how to process such data to create your own visualisations using ggplot2(Wickham 2009).

The data comes from a study that identified transcriptomic signatures of tuberculosis (TB) infection(Singhania et al. 2018). The researchers used RNAseq to profile transcriptomes from patients with active TB, latent TB, those who later developed TB (TB progressors), and control patients.

For this practical, you’ll compare these groups and produce volcano plots and box plots/violin plots to summarise and interpret the results.

What you will need for this practical

  1. To have R and Posit RStudio ready. We will work in RStudio.

  2. To create a directory for this practical and save it somewhere easy to locate. You should call it ‘data_visualisation’. We suggest that you create this directory inside your Documents folder or on your Desktop.

  3. To Download the Practical materials and save them inside a folder called ‘data’, inside your ‘data_visualisation’ directory. There are multiple csv files: ‘active_TB_vs_control_results.csv’, ‘TB_progressor_vs_control_results.csv’, ‘normalised_TB_data.csv’, ‘TB_metadata.csv’, within the Practical page in Canvas. You should make sure these are all inside the ‘data’ folder.

  4. To change your working directory to the ‘data_visualisation’ folder.

  5. To open a new .R script and save it in the ‘data_visualisation’ folder. You will continue working on this script as you proceed through the practical, using the example code we provide. I’d strongly recommend you type this example code rather than copying and pasting. You will make more mistakes but those mistakes will aid in your learning and retention!

Worked examples

Volcano plot

A volcano plot is a specific type of scatter plot that shows statistical significance (-log10 adjusted p-values) versus magnitude of change (log fold change). They are common in –omics analyses, where they help identify changes between two groups of subjects, e.g., treatment vs control. Each point represents a gene, protein, metabolite or similar.

We’ll start by reading in data (comparing patients with active TB and controls) and loading the R packages we need.

# We may need to install tidyverse and ggrepel if they aren't already installed
# If so, run this code: 

install.packages("tidyverse")
install.packages("ggrepel")
# Load libraries
library(tidyverse) # Loads ggplot2, magrittr, and dplyr
library(ggrepel)   # For better text labels on plots
# Remember that we use the <- to assign values to variables 
# We use the # to add comments to our code, you should also use comments to help you remember why you did certain things

# Read in the data
activeTB_vs_control <- read.csv("data/active_TB_vs_control_results.csv", row.names=1)

Before creating any plots, let’s quickly inspect the data:

# View dimensions of the data
dim(activeTB_vs_control)
[1] 17786     6
# Look at the structure of the data
head(activeTB_vs_control, 10)
            logFC  AveExpr        t      P.Value    adj.P.Val        B
FCGR1A   3.301767 6.149391 12.02848 1.223836e-21 2.176714e-17 38.65068
SERPING1 3.825079 6.433310 11.62863 9.709342e-21 8.634518e-17 36.60604
UBE2L6   1.801724 8.197470 11.54765 1.478635e-20 8.766333e-17 36.21297
GBP4     2.025176 6.142948 11.33775 4.406000e-20 1.656056e-16 35.14543
GBP5     2.653025 8.099339 11.32718 4.655505e-20 1.656056e-16 35.09147
WARS     1.992432 8.349681 11.19540 9.250438e-20 2.742138e-16 34.41573
BATF2    3.936097 5.022033 11.08332 1.659741e-19 4.217165e-16 33.74777
GBP2     1.521199 9.124542 11.04558 2.021065e-19 4.493333e-16 33.64526
APOL1    1.949465 6.590448 10.97231 2.962881e-19 5.372448e-16 33.27924
SCO2     2.230601 7.101993 10.96861 3.020605e-19 5.372448e-16 33.26115


You should see 17,786 rows (genes) and six columns (variables). Here’s what each column represents:

Description of the different columns in the data. The six columns are a typical output of a differential expression analysis produced by software such as limma, DeSEQ2 or edgeR.
Column name Description
logFC The log fold-change column, it is an estimate of the log2-fold-change of the mean expression between the comparisons of interest
Avexpr The average expression of the gene across all samples (log2 scale)
t Moderated t-statistic i.e. the ratio of log2FC to its estimated standard error
P.value The raw p-values
adj.P.Val The p-values adjusted for multiple/testing false discovery rate (FDR). The default Benjamini-Hochberg method was used to adjust the p-values for FDR here.
B Log-odds that the gene is differentially expressed, assuming a certain fixed percentage of differentially expressed genes. Generally, the higher the value, the greater the chance the gene is differentially expressed
Note

Always use adjusted p-values to account for multiple testing.


Building the volcano plot step by step

Let’s start building our volcano plot using ggplot2. First, we’ll add a new column to our data, which transforms the adjusted p-values into the -log10 scale (this is for visualisation purposes):

# Add log10-adjusted p-value column
activeTB_vs_control$log10_p.adjust <- -log10(activeTB_vs_control$adj.P.Val)

# Confirm the new column is added
dim(activeTB_vs_control)
[1] 17786     7

Now that we are familiar with the data and have made the adjustments we need, we can begin to construct our plot using ggplot2. As mentionedt in the lecture, ggplot2 uses the ‘grammar of graphics’ (shown below).

The grammar of graphics. Each layer is added on top of eachother sequentially, starting from the data at the bottom to build a plot. Data, aesthetics and geometries are the three most essential grammatical elements and are required to produce a plot.


Let’s build the plot layer by layer:

Start with the data

First, let’s see what happens if we just provide the ggplot function with our data.

# Create a blank ggplot object with the data
p <- ggplot(data = activeTB_vs_control)

# Display the empty canvas
p

This results in just a blank canvas. All we have done is store our data inside our ggplot2 object, and provided no additional information on the aesthetics or geometries.

Add axis mappings

Next, let’s provide ggplot2 with our x and y axes using the aes function inside our ggplot function…

# Map the logFC to x and log10_p.adjust to y
p <- ggplot(data = activeTB_vs_control, aes(x=logFC, y=log10_p.adjust))

# Display the grid
p

Now, instead of a blank canvas we get a grid with our x and y axes shown.

Note

It is common for new ggplot2 users to miss off the closing brackets when providing the aesthetics inside the ggplot function, if you are getting errors make sure this is not the case!

Add points (geometries)

Next, we are going to tell ggplot which geometries to use. This is one of the aspects that makes ggplot so versatile but can also be one of the most difficult aspects to grasp.

Here we are telling ggplot2 whether to display our data as lines, bars, points, etc. We do so by adding a ‘+’ sign after our initial ggplot() call, which is adding our extra layer.

# Add points to the plot
p <- ggplot(activeTB_vs_control, aes(logFC, log10_p.adjust)) + 
  geom_point()

p

Now, we have produced a (very basic!) volcano plot. Each point represents an individual gene plotted by their logFC and -log10 adjusted p-values. What observations can you make about this plot, do you notice any patterns? What happens if you provide the adjusted p-values (adj.P.Val) instead of the -log10 transformed p-values? Does this make any differences to the shape of the data?

So far our volcano plot is very plain, we are now going to improve it step by step, using other features of ggplot2. We can keep adding new layers by using the ‘+’ sign at the end of a line (never at the beginning). It is recommended to move onto a new line after each ‘+’ to make your code more readable.

Note

Before we proceed, it is worth stating that you do not need to manually specify data =, x= and y= each time you use ggplot2. As long as these arguments are in the same order ggplot2 will automatically recognise them.

# So this
p <- ggplot(data = activeTB_vs_control, aes(x=logFC, y=log10_p.adjust)) + 
  geom_point() 

p


# Can become this
p <- ggplot(activeTB_vs_control, aes(logFC, log10_p.adjust)) + 
  geom_point() 

p

Improve information

Add significance thresholds

Currently, it is difficult to tell which genes are statistically significant looking at our plot. We are going to add dashed lines to our x- and y- axes to show which genes meet thresholds for significance.

p <- ggplot(activeTB_vs_control, aes(logFC, log10_p.adjust)) + # Data and mapping/aesthetics
  geom_point() + # Geometries, points
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") + # Geometry, horizontal line
  geom_vline(xintercept = c(-1, 1), linetype = "dashed") # Geometry, vertical line

p

geom_hline adds a horizontal line to our plot. We tell it to intercept the y-axis at -log10(0.05). This is because our threshold for significance is p-adjust < 0.05, and we need to -log10 transform this value to match the transformation we made to the data.

geom_vline adds vertical line(s) to our plot. Here you can see we told it to produce multiple lines by providing a vector of values c(-1, 1). You could add additional lines by including additional numbers inside the vector - try this yourself! Here -1 and 1 log2FC would correspond to the gene expression either halving (down-regulated) or doubling (up-regulated) relative to the control. From our new volcano plot we can see more genes are up-regulated in the active TB group compared to the control.

Add gene labels

It would be useful if we could see specific gene names on this plot. This will be the next layer we add. Here we will make use of the other library we loaded earlier, ggrepel. This allows us to use the geometry geom_text_repel, which is an improvement on ggplot2's native geom_label function. Essentially, this stops the text labels overlapping each other. In the code below we also set a ‘seed’, which essentially means that our labels will be in the same place every time, rather than positioned randomly - this helps with reproducibility with our figures.

Let’s label the top 10 most significant genes using geom_text_repel().

# Get top 10 most significant genes
top_genes <- activeTB_vs_control %>%
  rownames_to_column(var = "Genes") %>% # Add a new column to the dataframe containing the gene names
  arrange(adj.P.Val) %>% # Arrange in decending order of adjusted p-value
  head(10) # Take the top 10


# Set seed to ensure positioning of labels is the same every time
set.seed(42)

# Add gene labels to the plot
p <- ggplot(activeTB_vs_control, aes(logFC, log10_p.adjust)) + 
  geom_point() +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
  geom_text_repel(data = top_genes, aes(label=Genes), max.overlaps = Inf, min.segment.length = 0)

p

We now have a volcano plot with significance thresholds and the most significant genes labelled. Can you change the significance thresholds on your plot and increase/decrease the number of gene names displayed?

Improve aesthetics

Our plot is now a lot more informative but it is still quite ugly! In this next section, we’ll improve the visual appearance by adjusting the axis labels, theme, and colours.

Axis-labels
p <- p + xlab(expression("log2FC")) +
    ylab(expression("-log10(FDR)"))

p

We can improve these even further by using mathematical notation. To do so, we make use of the expression function. The square brackets indicate that the number “2” should be written as a subscript. Everything in quotes is plain text. The “*” removes spacing between the elements.

p <- p + xlab(expression("log"[2]*"FC")) +
    ylab(expression("-log"[10]*"FDR"))

p

Theme

Next, we’ll add a theme to our plot. In general, the default ggplot2 theme is instantly recognizable and can make plots look unfinished/amateurish. Customizing the theme is one of the quickest and easiest ways to instantly improve a plot!

Below we apply ggplot2's ‘classic’ theme.

p <- p + theme_classic()

p

Exercise 1

This link contains a list of themes included with the ggplot2 package. Can you change the theme of your volcano plot to something different?

Colour points by significance

Now, we’ll add some visual interest and extra readability through use of some colour.

To do this, we’re going to add an additional column to our original data, stating whether a gene is significantly up-regulated or down-regulated. You don’t need to understand how the code works below but to summarise, we are making use of other functions from other tidyverse packages.

We take our original dataframe and pass it through a series of processing steps using the pipe function, %>%. The first step is to convert the row names to a new column called “Genes”. The second step is to add a new column using the mutate function by evaluating certain conditions. Here we create a new column called “Expression” which says a gene is “Up-regulated” when it has a logFC >=1 and adjusted p-value <= 0.05, a gene is “Down-regulated” when it has a logFC <= -1 and adjusted p-value <= 0.05, and “Unchanged” if it doesn’t meet either of those conditions.

We save all these changes to a new dataframe called “activeTB_vs_control_modified”.

activeTB_vs_control_modified <- activeTB_vs_control %>% # Pipe the dataframe through to the next function
    rownames_to_column(var = "Genes") %>% # Add the row names as a new column called gene
    mutate(
      Expression = case_when(logFC >= 1 & adj.P.Val <= 0.05 ~ "Up-regulated",
                             logFC <= -1 & adj.P.Val <= 0.05 ~ "Down-regulated",
                             TRUE ~ "Unchanged") # Add a new column by evaluating certain conditions
    )


head(activeTB_vs_control_modified)
     Genes    logFC  AveExpr        t      P.Value    adj.P.Val        B
1   FCGR1A 3.301767 6.149391 12.02848 1.223836e-21 2.176714e-17 38.65068
2 SERPING1 3.825079 6.433310 11.62863 9.709342e-21 8.634518e-17 36.60604
3   UBE2L6 1.801724 8.197470 11.54765 1.478635e-20 8.766333e-17 36.21297
4     GBP4 2.025176 6.142948 11.33775 4.406000e-20 1.656056e-16 35.14543
5     GBP5 2.653025 8.099339 11.32718 4.655505e-20 1.656056e-16 35.09147
6     WARS 1.992432 8.349681 11.19540 9.250438e-20 2.742138e-16 34.41573
  log10_p.adjust   Expression
1       16.66220 Up-regulated
2       16.06376 Up-regulated
3       16.05718 Up-regulated
4       15.78092 Up-regulated
5       15.78092 Up-regulated
6       15.56191 Up-regulated

We can add this new data to our ggplot2 object and then add a new layer with the points coloured by “Expression”.

# Add the new data to the ggplot2 object using '%+%'
p <- p %+% activeTB_vs_control_modified


# Add a new layer to the plot, a geometry, the points coloured by "Expression"
p <- p + geom_point(aes(color = Expression))

p

The colours used above are the default colours built into ggplot2. We can set our own custom colours by using R’s inbuilt colours or—if you’d prefer to pick your own—by using 6-digit hex codes. R’s named colours are below, let’s use some of those for now.

The 657 named colours in R. From Greg Gilbert lab.
p <- p + scale_color_manual(values = c("dodgerblue3", "black", "firebrick3")) 

p

Can you see any issues with this plot and the previous one…?

Answer
Reveal

The points are on top of our labels and dashed lines! This is because each new geometry layer added with ‘+’ is layered over the previous one. This means when we added a new layer of coloured points they covered the first layer of points, the dashed lines, and the labels.

To overcome this issue, we’ll create our plot again from the beginning, this time correctly ordering all the geometries.

p <- ggplot(activeTB_vs_control_modified, aes(logFC, log10_p.adjust)) + # New data
  geom_point(aes(color = Expression)) + # Geometries, points
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") + 
  geom_vline(xintercept = c(-1, 1), linetype = "dashed") + 
  geom_text_repel(data = top_genes, aes(label=Genes), max.overlaps = Inf, min.segment.length = 0) + 
  scale_color_manual(values = c("dodgerblue3", "black", "firebrick3")) +
  xlab(expression("log"[2]*"FC")) +
  ylab(expression("-log"[10]*"FDR")) + 
  theme_classic()

p

Title

Now our plot is looking good! Finally, we’ll add a title.

p <- p + ggtitle("Volcano plot - active TB vs control")

p

Save it

After creating our volcano plot, we likely want to save it to our working directory. To do so we can use ggplot2’s inbuilt function ggsave. An example is below:

# Save plot to PDF
ggsave("volcano_plot_active_TB_vs_control.pdf", # File name we want to save our plot as
       p, # The ggplot2 plot object
       width = 5, # Width of plot
       height = 4) # Height of plot

You’ll note I saved this file as a .pdf instead of a .png (or other image type). This is because plots saved as .pdfs are vectorised, meaning you can scale them as large or small as you like and they won’t get blurry. It also means we can import them into software such as InkScape or Powerpoint to edit them manually if needed.

Altogether

We can bring all this code together and run it in one code chunk. I’ve made a few additional modifications to the code so that gene names for up- and down-regulated genes are both displayed. This is done during the second step where we determine the ‘top_genes’.

# Add columns for gene names and expression
activeTB_vs_control_modified <- activeTB_vs_control %>% 
    rownames_to_column(var = "Genes") %>% 
    mutate(
      Expression = case_when(logFC >= 1 & adj.P.Val <= 0.05 ~ "Up-regulated",
                             logFC <= -1 & adj.P.Val <= 0.05 ~ "Down-regulated",
                             TRUE ~ "Unchanged") 
    )



# Determine top differentially expressed genes
top_genes <- bind_rows(
      activeTB_vs_control_modified %>% 
        filter(Expression == 'Up-regulated') %>% 
        arrange(adj.P.Val, desc(abs(logFC))) %>% 
        head(5),
      activeTB_vs_control_modified %>% 
        filter(Expression == 'Down-regulated') %>% 
        arrange(adj.P.Val, desc(abs(logFC))) %>% 
        head(5)
    )


# Produce plot
p <- ggplot(activeTB_vs_control_modified, aes(logFC, log10_p.adjust)) + # Use the modified data
  geom_point(aes(color = Expression)) + 
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") + 
  geom_vline(xintercept = c(-1, 1), linetype = "dashed") + 
  geom_text_repel(data = top_genes, aes(label=Genes), max.overlaps = Inf, min.segment.length = 0) + 
  scale_color_manual(values = c("dodgerblue3", "black", "firebrick3")) +
  xlab(expression("log"[2]*"FC")) +
  ylab(expression("-log"[10]*"FDR")) + 
  theme_classic() +
  ggtitle("Volcano plot - active TB vs control")


# Save plot
ggsave("volcano_plot_active_TB_vs_control.pdf", 
       p, 
       width = 5, 
       height = 4) 


The very final thing is to give our volcano plot a figure legend. It is important that the figure legend stands alone and contains all the information you need to interpret it. Below is our volcano plot, along with a figure legend typical of what you might see in a published paper.

Figure 1: Volcano plot depicting differential gene expression between active tuberculosis (TB) and control samples. The x-axis shows log2 fold change (log2FC) in gene expression, with upregulated genes in TB on the right (log2FC > 1) and downregulated genes on the left (log2FC < -1). The y-axis represents statistical significance (-log10 FDR), with the dashed horizontal line marking the threshold (FDR = 0.05). Genes above this line are significant, with colored points indicating upregulated (red), downregulated (blue), and non-significant (black) genes. The most significant up- and down- regulated genes are labelled.

So bringing this together, what does it all mean? To summarise, our volcano plot demonstrates that most genes are not differentially expressed, as indicated by the black points below the horizontal line and between the two vertical lines. This is typical of most -omics experiments and is what gives rise to the characteristic volcano shape. However, the plot also shows a substantial number of genes that are differentially expressed. Among these, many are highly significant, with adjusted p-values less than 0.01 or even 0.001. Overall, there are more up-regulated genes with large log2FC values. In a scientific paper, the next step would be to characterise these genes and put them in context, for example, by exploring whether genes from the same pathways are co-regulated. This figure is just the beginning of telling a story…

You now have the code to create and the know-how to interpret volcano plots of your own! Below are some exercises to put what you’ve just learned to use.

Exercise 2

In our previous volcano plot, we considered significantly upregulated genes as those having a logFC >= 1 and an adjusted p-value <= 0.05. Similarly, significantly downregulated genes were those with a logFC <= -1 and an adjusted p-value <= 0.05. However, in some studies, especially those involving human data, we may need more stringent thresholds. For example, we could set the p-value cut-off to 0.01 and focus only on genes with a logFC >= 2 or <= -2, corresponding to a 4-fold change in expression.

  1. Can you create a new volcano plot using these more stringent cut-offs?
  2. Can you use different colors to represent significantly upregulated and downregulated genes?
    • Earlier, we used R’s in-built named colours. You’re free to continue using these or you could try selecting your own colours by using hex-codes instead. To generate hex-codes you could try one of the two websites below. Both allow you to search colours and provide suggestions for other colours that complement them.
    • https://color.adobe.com/create/color-wheel
    • https://www.colorhexa.com/6738c9
Answer
# Modify the dataset by adding a new column for gene expression classification
activeTB_vs_control_modified <- activeTB_vs_control %>% 
  rownames_to_column(var = "Genes") %>% 
  mutate(
    Expression = case_when(
      logFC >= 2 & adj.P.Val <= 0.01 ~ "Up-regulated",
      logFC <= -2 & adj.P.Val <= 0.01 ~ "Down-regulated",
      TRUE ~ "Unchanged"
    )
  )

# Select the top 5 up- and down-regulated genes
top_genes <- bind_rows(
  activeTB_vs_control_modified %>% 
    filter(Expression == 'Up-regulated') %>% 
    arrange(adj.P.Val, desc(abs(logFC))) %>% 
    head(5),
  activeTB_vs_control_modified %>% 
    filter(Expression == 'Down-regulated') %>% 
    arrange(adj.P.Val, desc(abs(logFC))) %>% 
    head(5)
)

# Generate the volcano plot
# Using a mix of hex-codes and named colours for colours
p <- ggplot(activeTB_vs_control_modified, aes(logFC, log10_p.adjust)) + 
  geom_point(aes(color = Expression)) + 
  geom_hline(yintercept = -log10(0.01), linetype = "dashed") + 
  geom_vline(xintercept = c(-2, 2), linetype = "dashed") + 
  geom_text_repel(data = top_genes, aes(label=Genes), max.overlaps = Inf, min.segment.length = 0) + 
  scale_color_manual(values = c("#2f9fa1", "#454343", "orange1")) +
  xlab(expression("log"[2]*"FC")) +
  ylab(expression("-log"[10]*"FDR")) + 
  theme_classic() +
  ggtitle("Volcano Plot - Active TB vs Control")

p

Exercise 3

Inside the data folder, you’ll find another dataset named TB_progressor_vs_control_results.csv, which contains the results from patients who later developed TB compared to controls.

You can load the data with the following command:

TBprogressor_vs_control <- read.csv("data/TB_progressor_vs_control_results.csv", row.names=1)
  1. Can you create a volcano plot with this new data? You can choose which cut-offs and colors to use. Remember, you’ll need to follow the earlier pre-processing steps as well.
  2. What differences do you observe between this volcano plot and the one you created earlier? To make comparison easier, you could name this plot p2 instead of p.
Answer 1
# Load the data
TBprogressor_vs_control <- read.csv("data/TB_progressor_vs_control_results.csv", row.names=1)

# Add a column for -log10 transformed adjusted p-values
TBprogressor_vs_control$log10_p.adjust <- -log(TBprogressor_vs_control$adj.P.Val, 10)

# Modify the dataset to include gene expression classification
TBprogressor_vs_control_modified <- TBprogressor_vs_control %>% 
  rownames_to_column(var = "Genes") %>% 
  mutate(
    Expression = case_when(
      logFC >= 1 & adj.P.Val <= 0.05 ~ "Up-regulated",
      logFC <= -1 & adj.P.Val <= 0.05 ~ "Down-regulated",
      TRUE ~ "Unchanged"
    )
  )

# Select the top 5 up- and down-regulated genes
top_genes <- bind_rows(
  TBprogressor_vs_control_modified %>% 
    filter(Expression == 'Up-regulated') %>% 
    arrange(adj.P.Val, desc(abs(logFC))) %>% 
    head(5),
  TBprogressor_vs_control_modified %>% 
    filter(Expression == 'Down-regulated') %>% 
    arrange(adj.P.Val, desc(abs(logFC))) %>% 
    head(5)
)

# Generate the volcano plot
p2 <- ggplot(TBprogressor_vs_control_modified, aes(logFC, log10_p.adjust)) + 
  geom_point(aes(color = Expression)) + 
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") + 
  geom_vline(xintercept = c(-1, 1), linetype = "dashed") + 
  geom_text_repel(data = top_genes, aes(label=Genes), max.overlaps = Inf, min.segment.length = 0) + 
  scale_color_manual(values = c("dodgerblue3", "black", "firebrick3")) +
  xlab(expression("log"[2]*"FC")) +
  ylab(expression("-log"[10]*"FDR")) + 
  theme_classic() +
  ggtitle("Volcano Plot - TB Progressor vs Control")

p2

Answer 2

The active TB group have many more significantly differentially expressed genes compared to the control group. In addition, there are more upregulated genes in the active TB group, whereas the TB progressor group shows a more balanced distribution of upregulated and downregulated genes. Finally, the active TB group contains more highly significant genes, as shown by the higher values on the y-axis (up to 15) compared to the TB progressor group.

Exercise 4

Another common visualisation for differential expression analysis is the MA plot. An MA plot displays the relationship between the magnitude of change (M, log fold change) and the mean expression (A, average expression). In this plot, AveExpr (A) is plotted on the x-axis, and logFC (M) is plotted on the y-axis.

  1. Can you create an MA plot using this information?

    • Include a horizontal line at y = 0 to highlight genes with no fold change.
  2. (Advanced) Can you color the significant genes differently on the MA plot?

  3. (Optional) Can you write a figure legend for the MA plot in the same style as our volcano plot earlier? For this you could paste your MA plot into word and write the figure legend underneath.

Answer 1

Note: You don’t need to make the same visual/aesthetic choices for your MA plot, but this is an example. Additionally, the points have been made slightly transparent using the alpha=0.3 argument.

# Generate the MA plot
p <- ggplot(activeTB_vs_control_modified, aes(AveExpr, logFC)) + 
  geom_point(color="black", alpha=0.3) + 
  geom_hline(yintercept = 0, linetype = "dashed", color="firebrick", linewidth = 0.7) + 
  xlab(expression("Mean expression (A)")) +
  ylab(expression("log"[2]*"FC (M)")) + 
  theme_classic() +
  ggtitle("MA Plot")

p

Answer 2
# Add a new column to mark significant genes
activeTB_vs_control_modified <- activeTB_vs_control_modified %>% 
  mutate(
    Significant_genes = case_when(
      adj.P.Val <= 0.05 ~ "Significant genes",
      TRUE ~ "Non-significant genes"
    )
  )

# Generate the MA plot with color for significant genes
p <- ggplot(activeTB_vs_control_modified, aes(AveExpr, logFC)) + 
  geom_point(aes(color=Significant_genes), alpha=0.3) + 
  geom_hline(yintercept = 0, linetype = "dashed", color="firebrick", linewidth = 0.7) + 
  scale_color_manual(name = "Significance", values = c("black", "dodgerblue3")) +
  xlab(expression("Mean expression (A)")) +
  ylab(expression("log"[2]*"FC (M)")) + 
  theme_classic() +
  ggtitle("MA Plot")

p

Box plots and violin plots

Our volcano plots gave us an idea of which genes in our dataset might be significantly different between the active TB and control samples. We might next want to visualise the distribution of the samples in the two groups for our genes of interest. This is is a perfect use for a box plot (Krzywinski and Altman 2014).

For this exercise we will have process our data for it to be suitable for ggplot2.

First, we’ll read the data in.

sample_data <- read.csv("data/normalised_TB_data.csv", row.names=1)
metadata <- read.csv("data/TB_metadata.csv", row.names=1)

As previously we’ll do some quick inspections of the data.

dim(sample_data)
[1] 17786   175
dim(metadata)
[1] 175   4
sample_data[1:6, 1:4]
         Active.TB TB.progressor TB.progressor.1 TB.progressor.2
TSPAN6   -2.680353    -0.4008798       -1.019455     -1.80882670
DPM1      3.809980     3.4649190        3.837851      4.01469200
SCYL3     3.925248     3.8028715        3.645232      4.81295776
C1orf112  1.844374     1.3013206        1.423615      2.15749010
FGR      10.457478     9.5603012        9.667233     10.59497830
CFH       1.619406     0.9435486        0.976515     -0.05105541
head(metadata)
                        Group Age  Ethnicity Gender
Active TB           Active TB  29 South Asia   Male
TB progressor   TB progressor  23 South Asia Female
TB progressor.1 TB progressor  23 South Asia Female
TB progressor.2 TB progressor  23 South Asia Female
TB progressor.3 TB progressor  23 South Asia Female
Active TB.1         Active TB  25 South Asia   Male

We can see there are 17,786 rows and 175 columns in the data for our samples, corresponding to the 17,786 genes and 175 samples. The metadata has 175 rows (the samples) and 4 columns. For the metadata the one we are interested in is Group, telling us which TB group the patients belong to.

Single gene

Inspection of the data and our volcano plot shows that ‘FCGR1A’ is the most significant differentially expressed gene in the active TB vs control comparison. We’ll now create box plots to show the distribution of gene expression in the different groups.

Create data frame

We’ll create a dataframe of FCGR1A expression by combining the FCGR1A values from the sample data with the group information from our meta data.

FCGR1A_data <- data.frame(t(sample_data["FCGR1A",]), metadata$Group)
colnames(FCGR1A_data) <- c("Expression", "Group")
Produce box plot

To start, we’ll produce a simple box plot. Instead of using geom_point(), we use the geom_boxplot() function to indicate that we are using box plots to represent the data.

p <- ggplot(FCGR1A_data, aes(Group, Expression)) +
  geom_boxplot()

p

Though this plot is simple, it tells us several things. Perhaps the most obvious is the line inside the boxes, which represents the median (50th percentile), showing the central value of the data in each group. The box itself represents the inter-quartile range, with the top of the boxes representing the 75th percentile and the bottom of the boxes representing the 25th percentile. The vertical lines—or whiskers, as they’re mostly commonly known—show the range of the data, typically excluding outliers. The points outside the whiskers are outliers.

In the context of our data, we can see the Active TB group has the highest median value for the FCGR1A gene, and the Control group has the lowest median value. The whiskers show that the TB progressor group has the widest range of data, we can also see that the Latent TB group has an outlier value.

Now, we are going to improve the appearance of our box plot and include additional information.

Exercise 1

Using the skills we gained in the previous section, can you make these box plots look nicer? You might consider:

  • The axis labels

  • The theme

  • The colour of the box plots (hint: you can set the colour with geom_boxplot() by using aes() inside the function).

Answer

Below is an example where we have changed the axis-labels, in this case getting rid of the redundant x-axis label and making our y-axis label more informative. We have added a theme and changed the colours of the box plots. Note that setting the colour changes the outline of our box plots rather than the fill. This is a separate argument for ggplot2, which we will go into more below.

p <- ggplot(FCGR1A_data, aes(Group, Expression)) +
  geom_boxplot(aes(color=Group)) + 
  scale_color_manual(values=c("#455054", "#308695", "#D45769", "#E69D45")) +
  xlab("") +
  ylab(expression("Log"[2]*" FCGR1A expression")) + 
  theme_classic()

p

Fill the boxes

If we want to fill the inside of our box plots instead of changing the outline we need to use the ‘fill’ argument in our aes() call instead, and use the scale_fill_manual() function to set the colours. It is optional which you’d prefer to colour—if any!—and depends on your preferences.

p <- ggplot(FCGR1A_data, aes(Group, Expression)) +
  geom_boxplot(aes(fill=Group)) + 
  scale_fill_manual(values=c("#455054", "#308695", "#D45769", "#E69D45")) +
  xlab("") +
  ylab(expression("Log"[2]*" FCGR1A expression")) + 
  theme_classic()

p

Sometimes it can be hard to manually chose our own colour palette, especially if you’re new to colour theory and don’t know which colours complement eachother. Luckily there are premade colour palettes in R (as part of the package RColorBrewer) that are available to you in ggplot2. To use a one of these premade colour palettes we use the scale_fill_brewer() function instead of scale_fill_manual(). Inside the function call you need to specify the name of the palette you want to use, in this case we’ll use “Dark2” but some other example palettes are below:

Brewer palettes from RColorBrewer package
p <- ggplot(FCGR1A_data, aes(Group, Expression)) +
  geom_boxplot(aes(fill=Group)) + 
  scale_fill_brewer(palette="Dark2") + # Use the 'Dark2' premade colour palette
  xlab("") +
  ylab(expression("Log"[2]*" FCGR1A expression")) + 
  theme_classic()

p

Violin plot

Violin plots are closely related to box plots and often share the same purpose, however they can add additional useful information such as the distribution of the sample data (density trace). This is demonstrated in the figure below.

Seven distributions of data, shown as raw data points, box plots, and violin plots. Taken from ‘Same Stats, Different Graphs’, 2017.

We will produce a basic violin plot below:

p <- ggplot(FCGR1A_data, aes(Group, Expression)) +
  geom_violin() 

p

Exercise 2
  1. Modify the violin plot code so it resembles the box plots above (theme, labels, colour/fill - you can use the manual colour palette or the premade brewer palette).

  2. What observations can you make about the four groups from the data?

Answer
p <- ggplot(FCGR1A_data, aes(Group, Expression)) +
  geom_violin(aes(fill=Group)) + 
  scale_fill_manual(values=c("#455054", "#308695", "#D45769", "#E69D45")) +
  xlab("") +
  ylab(expression("Log"[2]*" FCGR1A expression")) + 
  theme_classic()

p

The violin plots tell us multiple things. The distribution of the control samples is roughly normal, and the lowest of the four groups. The latent TB group has a similar distribution but some notable high outliers. The TB progressor group has the widest distribution, reflecting that some of these patients may be in the process of developing TB. The active TB group has the highest distribution of values, with very little overlap with the control samples. At a glance it is easy to see these two groups are very different!

Box and violin
Exercise 3

Before beginning the next section, please attempt to create a violin plot with a box plot overlaid on top. Don’t worry if you can’t manage this on your own, we will walk through the process below.

It is more typical for violin plots to have either the individual data points displayed on top or a box plot. To produce such a visualisation, we need to layer our ggplot2 object correctly. The code is below, though has been hidden to allow you to attempt the above exercise first.

Code
p <- ggplot(FCGR1A_data, aes(Group, Expression, fill=Group)) +
  geom_violin(color=NA, alpha=0.3, width=0.6) + 
  geom_boxplot(width=0.2) +
  scale_fill_manual(values=c("#455054", "#308695", "#D45769", "#E69D45")) +
  xlab("") +
  ylab(expression("Log"[2]*" FCGR1A expression")) + 
  theme_classic()

p

Improve it further!

At this point our visualisation looks quite nice, but there are a few additional steps we can take to make it publication ready.

  • Reorder groups

  • Intentional colours

  • Hide redundant legend

To reorder the groups in our plot, we need to convert the group column to an ordered factor. This tells R there is an order to our categorical data.

FCGR1A_data$Group <- factor(FCGR1A_data$Group,
                            ordered = TRUE,
                            levels = c("Control", "Latent TB", "TB progressor", "Active TB"))

Now, let’s produce the plot again.

p <- ggplot(FCGR1A_data, aes(Group, Expression, fill=Group)) +
  geom_violin(color=NA, alpha=0.3, width=0.6) + 
  geom_boxplot(width=0.2) +
  scale_fill_manual(values=c("#308695", "#a077a6", "#ab547c", "#ba2b40")) +
  xlab("") +
  ylab(expression("Log"[2]*" FCGR1A expression")) + 
  theme_classic() + 
  theme(legend.position = "none") # Remove redundant legend 

p

We have reordered the variables so the ‘Control’ group is shown first, then the other groups in order of severity. Similarly, we have also employed a colour-gradient along this axis of severity. Finally, we have removed the redundant legend on the right by modifying the theme, as the grouping information is already included in the x-axis.

Multiple genes

Now that we’ve learned how to plot a single gene using box and violin plots, let’s move on to visualising multiple genes at once. For this exercise, we’ll use facet_wrap() to display multiple genes in a grid of plots. This allows us to see the expression of several genes of interest across the different groups.

Create a data frame for multiple genes

To get started, we need to extract the expression data for our genes of interest from the sample data and combine it with the group information from the metadata. We’ll do this for the genes: "FCGR1A", "SERPING1", "UBE2L6", "GBP4", "MAGEF1", and "RP5-1184F4.7".

Then, we’ll need to reshape our data into a ‘tidy’ format so that all gene expression data is in a single column, with an additional column to indicate the gene name(Wickham 2016). This will prepare the data for ggplot2 to handle multiple facets.

Data is considered ‘tidy’ when:

  1. Each variable forms a column

  2. Each observation forms a row

  3. Each observational unit forms a table

Let’s first create a dataframe of the gene expression data that we want to convert to a ‘tidy’ format.

# Genes of interest
genes_of_interest <- c("FCGR1A", "SERPING1", "UBE2L6", "GBP4", "MAGEF1", "RP5-1184F4.7")

# Extract the expression data for the genes of interest
multi_gene_data <- data.frame(t(sample_data[genes_of_interest, ]))

# Add group information
multi_gene_data$Group <- metadata$Group

# Inspect data structure
head(multi_gene_data)
                  FCGR1A  SERPING1    UBE2L6     GBP4   MAGEF1 RP5.1184F4.7
Active.TB       9.094789 10.028439 10.001784 7.813265 1.992708   -3.4549427
TB.progressor   3.375139  4.706795  7.225487 5.077932 2.850249    0.9744491
TB.progressor.1 5.639912  6.281472  7.883710 5.780228 2.655709    0.8747338
TB.progressor.2 9.538771 10.091418 10.475483 8.616317 1.900342    0.4368986
TB.progressor.3 6.524266  7.947446  8.976370 6.406779 2.474013    0.7824810
Active.TB.1     7.924561  7.958690  9.307978 7.194368 1.859675   -1.3883841
                        Group
Active.TB           Active TB
TB.progressor   TB progressor
TB.progressor.1 TB progressor
TB.progressor.2 TB progressor
TB.progressor.3 TB progressor
Active.TB.1         Active TB

Initially, each row represents a sample, and the “Group” column indicates which group (Control, Latent TB, etc.) that sample belongs to.

Now, we will convert the data to a ‘tidy’ format using tidyr‘s pivot_longer function. This ’lengthens’ the data, increasing the number of rows and decreasing the number of columns.

When we use the function below:

  • -Group tells R not to pivot the “Group” column - i.e. the grouping information stays as it is

  • The transformation is applied to all other columns

  • names_to = "Gene" specifies that the column names should be gathered into a new column called “Gene”.

  • values_to = "Expression means that the original gene expression columns are gathered into a new column called “Expression”

# Reshape data into tidy/long format for ggplot2
multi_gene_data <- multi_gene_data %>%
  pivot_longer(-Group, names_to = "Gene", values_to = "Expression")

# Convert group data to ordered factor
multi_gene_data$Group <- factor(multi_gene_data$Group,
                            ordered = TRUE,
                            levels = c("Control", "Latent TB", "TB progressor", "Active TB"))

head(multi_gene_data)
# A tibble: 6 × 3
  Group     Gene         Expression
  <ord>     <chr>             <dbl>
1 Active TB FCGR1A             9.09
2 Active TB SERPING1          10.0 
3 Active TB UBE2L6            10.0 
4 Active TB GBP4               7.81
5 Active TB MAGEF1             1.99
6 Active TB RP5.1184F4.7      -3.45

Comparing this dataframe to the one we first produced we can see that it now only has three columns: “Group”, “Gene”, and “Expression”, compared to our earlier dataframe that had separate columns for genes. This means our dataframe is now tidy, as all the information needed to facet the plot is contained in the “Gene” column.


Facet-wrapped plots

Now that our data is in the correct format, we can create a violin plot with box plots laid over the top for each gene. We’ll use facet_wrap() to create individual plots for each gene.

# Create the plot with facet_wrap
p <- ggplot(multi_gene_data, aes(Group, Expression, fill = Group)) +
  geom_violin(color = NA, alpha = 0.3, width = 0.6) + 
  geom_boxplot(width = 0.2) +
  facet_wrap(~ Gene, scales = "free_y") +  # Create a separate plot for each gene
  scale_fill_manual(values = c("#308695", "#a077a6", "#ab547c", "#ba2b40")) +  
  xlab("") +
  ylab(expression("Log"[2]*" expression")) + 
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")  # Rotate x-axis labels for readability, remove redundant legend 

p

In this plot:

  • We use facet_wrap(~ Gene) to create separate plots for each gene of interest.

  • The scales = "free_y" option ensures that each gene has its own y-axis scale, which is important when different genes have different expression ranges.

  • We’ve used geom_violin() function and geom_boxplot() functions as previously.

  • We’ve rotated the x-axis labels to make the plot more readable.

Faceting in ggplot2 is a powerful tool that enables you to create complex plots efficiently. By facet-wrapping multiple genes, we can quickly visualise the expression patterns across the groups for several genes at the same time (for example, groups of genes in the same pathway). This approach also offers flexibility for customising the appearance and layout of your plots, making it straightforward to produce publication-ready visuals with minimal effort.

Exercise 4

Can you write a figure legend for the violin plots as you’d find in a published paper? You could paste the plot into Microsoft Word or a similar word processor and write the figure legend below.

Answer

Figure 2: Log₂ expression levels of differentially expressed genes across different tuberculosis stages. Violin plots display the distribution of expression levels for each group, with overlaid box plots showing the median, interquartile range, and spread of the data. Each panel represents a different gene, with y-axes scaled independently. Groups are coloured by tuberculosis progression with control patients in blue, latent and progressors in purple and active tuberculosis in red.

Exercise 5

Can you create a facet-wrapped grid of new genes of interest? These could be:

  • Genes that are up-regulated or genes that are down-regulated in active TB vs control

  • The most significant genes from the TB progressor vs control comparison

Answer
# Significant genes from the TB progressor vs control comparison
genes_of_interest <- c("UBE2L6", "SERPING1", "KRT73-AS1", "ADAMTSL4", "BATF2", "CNR2")

# Extract the expression data for the genes of interest
multi_gene_data <- data.frame(t(sample_data[genes_of_interest, ]))

# Add group information
multi_gene_data$Group <- metadata$Group

# Reshape data into tidy/long format for ggplot2
multi_gene_data <- multi_gene_data %>%
  pivot_longer(-Group, names_to = "Gene", values_to = "Expression")

# Convert group data to ordered factor
multi_gene_data$Group <- factor(multi_gene_data$Group,
                            ordered = TRUE,
                            levels = c("Control", "Latent TB", "TB progressor", "Active TB"))

# Create the plot with facet_wrap
p <- ggplot(multi_gene_data, aes(Group, Expression, fill = Group)) +
  geom_violin(color = NA, alpha = 0.3, width = 0.6) + 
  geom_boxplot(width = 0.2) +
  facet_wrap(~ Gene, scales = "free_y") +  
  scale_fill_manual(values = c("#308695", "#a077a6", "#ab547c", "#ba2b40")) +  
  xlab("") +
  ylab(expression("Log"[2]*" expression")) + 
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")  

p

Exercise 6

(Advanced) Modify the existing violin + box plot so that individual data points are overlaid on the box plots and violins. You can use geom_jitter() or geom_point() to display the expression levels for each sample. Adjust the transparency (alpha) and size of the points to ensure the plot is readable.

Example:

geom_jitter(position = position_jitter(width = 0.2), alpha = 0.5, size = 0.8)
Answer
# With geom_point()
p <- ggplot(multi_gene_data, aes(Group, Expression, fill = Group)) +
  geom_violin(color = NA, alpha = 0.3, width = 0.6) + 
  geom_boxplot(width = 0.2) +
  geom_point(alpha = 0.5, size = 0.7) +  # Overlay data points
  facet_wrap(~ Gene, scales = "free_y") +  
  scale_fill_manual(values = c("#308695", "#a077a6", "#ab547c", "#ba2b40")) +  
  xlab("") +
  ylab(expression("Log"[2]*" expression")) + 
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")  

p

# With geom_jitter()
p <- ggplot(multi_gene_data, aes(Group, Expression, fill = Group)) +
  geom_violin(color = NA, alpha = 0.3, width = 0.6) + 
  geom_boxplot(width = 0.2) +
  geom_jitter(position = position_jitter(width = 0.1), alpha = 0.5, size = 0.7) +  # Overlay data points
  facet_wrap(~ Gene, scales = "free_y") +  
  scale_fill_manual(values = c("#308695", "#a077a6", "#ab547c", "#ba2b40")) +  
  xlab("") +
  ylab(expression("Log"[2]*" expression")) + 
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")  

p

Concluding thoughts

At first, ggplot2 can feel overwhelming, especially if you aren’t used to coding. While the plots it produces are beautiful, you might be tempted to revert to familiar tools like Excel or GraphPad for simplicity.

However, imagine a situation where you were comparing 10 or 20 different conditions, where you would need to generate new plots for each condition manually. This can lead to mistakes, inconsistencies in labelling or sizing, and inefficiency. With ggplot2, you can reuse the same code to produce consistent and reproducible plots. As you become more comfortable with R, you can automate your workflow, using loops and functions to generate dozens or even hundreds of plots in a matter of seconds—saving time and ensuring accuracy.

This workshop has provided you with two worked examples of the types of plots you can produce using ggplot2, however, your research projects will be extremely varied and will have their own unique plotting requirements. There are a wealth of free materials, tutorials and videos online and many community forums you can turn to if you get stuck.


Extra/supplementary materials

Plotting outside of ggplot2 - heatmaps!

In this workshop we have learned how to use ggplot2 and how versatile it can be for a number of applications. However, there are a few types of visualisation where other, more specialised, plotting tools can be preferable, for example, heatmaps.

Heatmaps are another commonly used visualisation used to represent -omics data, particularly when dealing with multiple genes/proteins/etc and samples. Heatmaps are often used in bioinformatics to show the expression levels of genes across different samples in a matrix format, where colours represent the expression values. In the following supplementary exercise we are going to compare producing a heatmap with ggplot2 vs an alternative plotting tool: ComplexHeatmap, which offers more functionality for heatmaps, such as sample annotation and hierarchical clustering.

You may first need to install ComplexHeatmap and another supporting package, cowplot, for this portion of the workshop. To do so, run the code below:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c("ComplexHeatmap", "cowplot"))

Heatmap using ggplot2

First, we will create a basic heatmap using ggplot2. Heatmaps in ggplot2 are created using the geom_tile() function, which allows us to represent each cell of the matrix as a coloured rectangle.

Create Data for Heatmap

We will use the same genes of interest that we explored with our box plots and violin plots. For this heatmap, we’ll extract the expression data for these genes, and then as before we’ll set the Group column as a factor to indicate the ordering of the TB groups. We’ll also include a new additional column for the sample names called ‘Sample_ID’. When we use pivot_longer() both ‘Sample_ID’ and ‘Group’ will be used as ID variables.

Another important step we take is to z-score scale the data for each gene.

Note

A z-score tells us the number of standard deviations a value is from the mean of a given distribution. Negative z-scores indicate the value lies below the mean. Positive z-scores indicate the value lies above the mean.

We want our data to be z-score scaled for each row of the heatmap, i.e. each gene. When we z-score scale the data for each gene we make it so that the mean of all of all the samples is 0 and the standard deviation is 1. This might sound confusing/unintuitive until we see it in action with our heatmaps below.

# Genes of interest
genes_of_interest <- c("FCGR1A", "SERPING1", "UBE2L6", "GBP4", "MAGEF1", "RP5-1184F4.7")

# Extract expression data for the selected genes
heatmap_data <- data.frame(t(sample_data[genes_of_interest, ]))

# Add group information
heatmap_data$Group <- metadata$Group

# Add sample names to a new column
heatmap_data$Sample_ID <- rownames(heatmap_data)

# Reshape data into long format for ggplot2 
heatmap_data_long <- heatmap_data %>%
  pivot_longer(-c(Sample_ID, Group), names_to = "Gene", values_to = "Expression")

# Z-score scale the expression values for each gene across the samples
heatmap_data_long <- heatmap_data_long %>%
  group_by(Gene) %>%
  mutate(Expression = scale(Expression))

# Convert group data to ordered factor
heatmap_data_long$Group <- factor(heatmap_data_long$Group,
                            ordered = TRUE,
                            levels = c("Control", "Latent TB", "TB progressor", "Active TB"))


# Inspect data structure
head(heatmap_data_long)
# A tibble: 6 × 4
# Groups:   Gene [6]
  Group     Sample_ID Gene         Expression[,1]
  <ord>     <chr>     <chr>                 <dbl>
1 Active TB Active.TB FCGR1A                 1.75
2 Active TB Active.TB SERPING1               1.80
3 Active TB Active.TB UBE2L6                 1.83
4 Active TB Active.TB GBP4                   1.54
5 Active TB Active.TB MAGEF1                -1.23
6 Active TB Active.TB RP5.1184F4.7          -2.71
Create a Heatmap with ggplot2

Next, we’ll create the heatmap using ggplot2. We’ll use the geom_tile() geometry to create the heatmap (the heatmap tiles). We’ll set the fill colour to represent the expression values and customise the plot further using themes and colour scales.

Unlike our previous worked examples, where we used discrete colour scales, here will will be using a continuous gradient colour scheme.

# Plot the heatmap
p <- ggplot(heatmap_data_long, aes(x = Sample_ID, y = Gene, fill = Expression)) +
  geom_tile() +  # Create heatmap cells
  scale_fill_gradientn(colours = c("blue", "white", "red")) +  # Set colour gradient for expression
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  xlab("Sample ID") +
  ylab("Gene") +
  ggtitle("Heatmap of Gene Expression")

p

Here we have quite a basic heatmap. The rows are the genes, the columns are the individual samples, so each individual tile represents each combination of a gene and a sample. The tiles are coloured by the Z-score scaled expression values. This allows us to see patterns of expression for each gene, relative to eachother.

The columns are ordered so that the Active TB samples are first, then the Control samples, then the Latent TB samples and finally the TB progressor samples. This allows us to see some vague patterns in the data. We can see that FCGR1A, SERPING1, UBE2L6 and GBP4 levels tend to be higher in the Active TB and TB progressor groups vs the Control and Latent TB groups. Inversely, MAGEF1 and RP5-1184F4.7 appear to have lower expression levels in the Active TB group vs every other group.

Exercise 1
  1. Can you create a heatmap with ggplot2 for the genes with the data z-score scaled by column (Sample_ID) rather than row? What differences do you see?
Answer
# Genes of interest
genes_of_interest <- c("FCGR1A", "SERPING1", "UBE2L6", "GBP4", "MAGEF1", "RP5-1184F4.7")

# Extract expression data for the selected genes
heatmap_data <- data.frame(t(sample_data[genes_of_interest, ]))

# Add group information
heatmap_data$Group <- metadata$Group

# Add sample names to a new column
heatmap_data$Sample_ID <- rownames(heatmap_data)

# Reshape data into long format for ggplot2 
heatmap_data_long <- heatmap_data %>%
  pivot_longer(-c(Sample_ID, Group), names_to = "Gene", values_to = "Expression")

# Z-score scale the expression values for each sample across the genes
heatmap_data_long_col_scaled <- heatmap_data_long %>%
  group_by(Sample_ID) %>%
  mutate(Expression = scale(Expression))

# Convert group data to ordered factor
heatmap_data_long_col_scaled$Group <- factor(heatmap_data_long_col_scaled$Group,
                            ordered = TRUE,
                            levels = c("Control", "Latent TB", "TB progressor", "Active TB"))


# Plot the heatmap
p_col_scaled <- ggplot(heatmap_data_long_col_scaled, aes(x = Sample_ID, y = Gene, fill = Expression)) +
  geom_tile() +  # Create heatmap cells
  scale_fill_gradientn(colours = c("blue", "white", "red")) +  # Set colour gradient for expression
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  xlab("Sample ID") +
  ylab("Gene") +
  ggtitle("Heatmap of gene expression - scaled for each sample")

p_col_scaled

Now we can see the heatmap is scaled for each column. Now we see patterns across the genes, instead of inside each gene. What this is showing is is that RP5-1184F4.7 has the lowest expression overall relative to the others and UBE2L6 has the highest expression overall relative to the others. Changes in expression within these genes between the groups only become apparent when we scale by row instead.

Improve it

However, this heatmap has a few short-comings. As there are so many samples it is difficult to read the labels on the x-axis. ggplot2 doesn’t include native support for cluster dendograms so it is difficult to get an idea of the relationship between samples.

Finally, we lack any sort of ‘Group’ information/annotation. We can manually create a ‘Group’ visualisation that we can combine with our heatmap. The way this works is that we create a seperate heatmap for our ‘Group’ information. We set our x-axis as our Sample_ID and our y-axis as 1 to create a grid 1 tile high. Then we fill the tiles with our grouping information.

# Load supporting library for creating plot grid
library(cowplot)

# Generate annotation plot for group information
annotation_plot <- ggplot(heatmap_data_long, aes(x = Sample_ID, y = 1, fill = Group)) +
  geom_tile() +
  scale_fill_manual(values = c("Control" = "#308695",
                               "Latent TB" = "#a077a6",
                               "TB progressor" = "#ab547c",
                               "Active TB" = "#ba2b40")) +
  theme_void() +  # Remove axis and grid lines
  theme(legend.position = "top") +  # Place legend on top
  ylab("Group") +
  guides(fill = guide_legend(title = "Group")) +  
  ggtitle("Heatmap of Gene Expression\n") # Reuse title here so it appears on top of final plot


# Remove x-axis labels from the main heatmap now that we have group labelling
p <- p + theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  ggtitle("")

# Combine annotation plot and heatmap using cowplot::plot_grid
combined_plot <- plot_grid(annotation_plot, p, ncol = 1, align = "v", rel_heights = c(0.3, 1))

# Display the combined plot
combined_plot

Though this is better, it is quite a lot of code/effort to produce an overall sub par heatmap!

Heatmap using ComplexHeatmap

Now, let’s explore the ComplexHeatmap package, which simplifies the creation of heatmaps with enhanced features.

Load relevant packages
# Load necessary libraries
library(ComplexHeatmap)
library(circlize)  # For colour scaling
Prepare data for ComplexHeatmap

Next, we’ll prepare the data for the ComplexHeatmap. One difference between ggplot2 and ComplexHeatmap is that ggplot2 expects the data to be in a tidy/long format, whereas ComplexHeatmap expects the data to be a matrix, with the gene names as the row names and the sample names as the column names.

As with our ggplot2 example, we will be z-score scaling our row data (genes) so that the mean of all of the values is 0 and the standard deviation is 1.

# Extract and scale the expression data for genes of interest
heatmap_data <- sample_data[genes_of_interest, ]
heatmap_data_scaled <- t(apply(heatmap_data, 1, scale))  # Z-score scale per gene
colnames(heatmap_data_scaled) <- colnames(heatmap_data)

# Prepare group information for annotation
group_annotation <- metadata$Group
Create a Basic Heatmap with ComplexHeatmap

We can now create a heatmap using the Heatmap() function from ComplexHeatmap. This function supports hierarchical clustering and allows us to add annotations and other customisation.

# Create a colour scale for expression values
col_fun <- colorRamp2(c(-2, 0, 2), c("blue", "white", "red"))

# Create heatmap annotations for groups
ha <- HeatmapAnnotation(Group = group_annotation, 
                        col = list(Group = c("Control" = "#308695", "Latent TB" = "#a077a6",
                                             "TB progressor" = "#ab547c", "Active TB" = "#ba2b40")))

# Generate the heatmap
ht <- ComplexHeatmap::Heatmap(heatmap_data_scaled, 
              name = "Expression", 
              col = col_fun, 
              top_annotation = ha,
              show_row_names = TRUE,
              show_column_names = FALSE,
              cluster_rows = TRUE,
              cluster_columns = TRUE)

To explain briefly what the code above is doing:

  • First we create a colour gradient function for our heatmap using colorRamp2 and assign it to col_fun

  • Secondly, we create an annotation object (ha) for our heatmap, indicating the different groups (e.g., Control, Latent TB) of the samples. Each group is assigned a specific colour.

  • The final part generates the heatmap (ht) itself using the Heatmap function.

    • heatmap_data_scaled: The scaled expression data to be visualised.

    • name: Sets the title for the colour scale (in this case, “Expression”).

    • col: Applies the colour scale function created earlier (col_fun).

    • top_annotation: Adds the group annotations (ha) at the top of the heatmap.

    • show_row_names: Displays the names of the genes on the y-axis.

    • show_column_names: Hides the sample names on the x-axis.

    • cluster_rows and cluster_columns: Carry out hierarchical clustering on the row and column data.

Now, let’s visualise our heatmap.

# Draw the heatmap
draw(ht)

This heatmap contains a lot more information compared to the one we produced with ggplot2!

From the row dendograms, we can see that MAGEF1 and RP5-1184F4.7 are both downregulated in Active TB and cluster together. The genes that are upregulated in Active TB also cluster together, with SERPING1 and UBE2L6 being most similar to eachother.

From the column dendograms we can see that there are three main clusters. The majority of the Active TB samples cluster together, with one group forming a large cluster on the left. The next cluster of samples is more mixed, though contains the majority of our TB progressor samples. This fits with the information we’d previously gained from our violin plots and box plots, showing the broad distribution of expression within the TB progressor samples. The final cluster on the right contains the majority of the control and latent TB samples - it is unsurprising they cluster together, as our violin plots showed the similar patterns of expressions for the selected genes within these two groups. By carrying out hierarchical clustering we are able to display these more sophisticated relationships between the samples.

Extra Exercises
Exercise 2

In this worked example we selected a handful of genes of interest based on our earlier volcano plot for the Active TB vs Control comparison.

  1. Can you create a heatmap with new genes of interest? These could be ones you’ve picked yourself or ones from the TB Progressor vs Control comparison.
  2. Can you create a heatmap with all significant genes, either from the Active TB vs Control or TB progressor vs control comparison.
Answer 1
# Top significant genes from the TB progressor vs control comparison
genes_of_interest <- c("UBE2L6", "SERPING1", "KRT73-AS1", "ADAMTSL4", "BATF2", "CNR2")

# Extract and scale the expression data for genes of interest
heatmap_data <- sample_data[genes_of_interest, ]
heatmap_data_scaled <- t(apply(heatmap_data, 1, scale))  # Z-score scale per gene
colnames(heatmap_data_scaled) <- colnames(heatmap_data)

# Prepare group information for annotation
group_annotation <- metadata$Group

# Create a colour scale for expression values
col_fun <- colorRamp2(c(-2, 0, 2), c("blue", "white", "red"))

# Create heatmap annotations for groups
ha <- HeatmapAnnotation(Group = group_annotation, 
                        col = list(Group = c("Control" = "#308695", "Latent TB" = "#a077a6",
                                             "TB progressor" = "#ab547c", "Active TB" = "#ba2b40")))

# Generate the heatmap
ht <- ComplexHeatmap::Heatmap(heatmap_data_scaled, 
              name = "Expression", 
              col = col_fun, 
              top_annotation = ha,
              show_row_names = TRUE,
              show_column_names = FALSE,
              cluster_rows = TRUE,
              cluster_columns = TRUE)

# Draw the heatmap
draw(ht)

Answer 2

Below, we use all the differentially expressed genes from the Active TB vs Control comparison. When we do so we see four large clusters on the resulting heatmap, these are broadly divided on the x-axis by Active TB vs the rest of the groups, then on the y-axis by genes that are upregulated in Active TB compared to the other conditions, vs those that are downregulated in Active TB compared to the other conditions.

# Top significant genes from the TB progressor vs control comparison
genes_of_interest <- activeTB_vs_control %>%
  filter(adj.P.Val < 0.05  & abs(logFC) > 1) %>%
  row.names()

# Extract and scale the expression data for genes of interest
heatmap_data <- sample_data[genes_of_interest, ]
heatmap_data_scaled <- t(apply(heatmap_data, 1, scale))  # Z-score scale per gene
colnames(heatmap_data_scaled) <- colnames(heatmap_data)

# Prepare group information for annotation
group_annotation <- metadata$Group

# Create a colour scale for expression values
col_fun <- colorRamp2(c(-2, 0, 2), c("blue", "white", "red"))

# Create heatmap annotations for groups
ha <- HeatmapAnnotation(Group = group_annotation, 
                        col = list(Group = c("Control" = "#308695", "Latent TB" = "#a077a6",
                                             "TB progressor" = "#ab547c", "Active TB" = "#ba2b40")))

# Generate the heatmap
ht_all_sig <- ComplexHeatmap::Heatmap(heatmap_data_scaled, 
              name = "Expression", 
              col = col_fun, 
              top_annotation = ha,
              show_row_names = FALSE, # row names are hidden due to number of genes
              show_column_names = FALSE,
              cluster_rows = TRUE,
              cluster_columns = TRUE)

# Draw the heatmap
draw(ht_all_sig)

Exercise 3
  1. Can you create a heatmap for the genes of interest without z-score scaling the data. What patterns do you see?
Answer
# Top significant genes from the TB progressor vs control comparison
genes_of_interest <- activeTB_vs_control %>%
  filter(adj.P.Val < 0.05  & abs(logFC) > 1) %>%
  row.names()

# Extract and scale the expression data for genes of interest
heatmap_data <- sample_data[genes_of_interest, ]


# Prepare group information for annotation
group_annotation <- metadata$Group

# Create a colour scale for expression values
col_fun <- colorRamp2(c(-2, 0, 2), c("blue", "white", "red"))

# Create heatmap annotations for groups
ha <- HeatmapAnnotation(Group = group_annotation, 
                        col = list(Group = c("Control" = "#308695", "Latent TB" = "#a077a6",
                                             "TB progressor" = "#ab547c", "Active TB" = "#ba2b40")))

# Generate the heatmap
ht_all_sig_not_scaled <- ComplexHeatmap::Heatmap(heatmap_data, 
              name = "Expression", 
              col = col_fun, 
              top_annotation = ha,
              show_row_names = FALSE, # row names are hidden due to number of genes
              show_column_names = FALSE,
              cluster_rows = TRUE,
              cluster_columns = TRUE)
Warning: The input is a data frame-like object, convert it to a matrix.
# Draw the heatmap
draw(ht_all_sig_not_scaled)

Adding extra detail to our heatmap

Finally, one of the thing that really sets ComplexHeatmap apart is the level of annotation we can add to our heatmaps. We will demonstrate this below by adding ‘age’ and ‘gender’ annotation from our metadata table.

# Create a colour scale for expression values
col_fun <- colorRamp2(c(-2, 0, 2), c("blue", "white", "red"))

# Create a color scale for continuous 'Age'
age_col_fun <- colorRamp2(c(min(metadata$Age), max(metadata$Age)), c("lightyellow", "darkorange"))

# Create a color scale for categorical 'Group', 'Ethnicity', and 'Gender'
col_fun_group <- list(Group = c("Control" = "#308695", 
                                "Latent TB" = "#a077a6",
                                "TB progressor" = "#ab547c", 
                                "Active TB" = "#ba2b40"))


col_fun_gender <- list(Gender = c("Male" = "#1F78B4", "Female" = "#FB9A99"))

# Create heatmap annotations for groups, ethnicity, gender, and age
ha <- HeatmapAnnotation(Group = metadata$Group, 
                        Gender = metadata$Gender,
                        Age = metadata$Age, 
                        col = c(col_fun_group, col_fun_gender, 
                                Age = age_col_fun))

# Generate the heatmap with added annotations
ht_extra_anno <- ComplexHeatmap::Heatmap(heatmap_data_scaled, 
              name = "Expression", 
              col = col_fun, 
              top_annotation = ha,
              show_row_names = FALSE, # row names are hidden due to number of genes
              show_column_names = FALSE,
              cluster_rows = TRUE,
              cluster_columns = TRUE)

# Draw the heatmap
draw(ht_extra_anno)

We can see that the samples still cluster mostly by ‘Group’, but there appears to be some mild clustering by ‘Gender’. ‘Age’ doesn’t appear to have much effect on the clustering.

Bonus exercises
  1. Can you modify the above code so that ethnicity is also included in the heatmap annotation.

  2. Can you modify the above code so a different continuous colour scale is used for the tiles of the heatmap?

Here the answers won’t be provided and I’ll leave you to figure this out on your own!

Advantages of ComplexHeatmap

ComplexHeatmap offers several advantages over ggplot2 for heatmap generation:

  • Hierarchical Clustering: By default, ComplexHeatmap performs clustering of both rows (genes) and columns (samples). This is crucial for identifying patterns in the data.

  • Annotations: It is easy to add sample annotations, such as group, gender, ethnicity, etc.

  • Dendrograms: The package includes dendrograms to show relationships between samples and genes based on their expression patterns.

Comparison: ggplot2 vs. ComplexHeatmap

While both ggplot2 and ComplexHeatmap are powerful, each has its strengths for specific tasks:

Feature ggplot2 ComplexHeatmap
Ease of use Extremely flexible but can require a lot of manual setup for certain types of plot Mostly specialised for heatmaps, easy to produce complex heatmaps with multiple levels of annotation
Clustering Requires manual calculation and ordering Automatic hierarchical clustering
Sample annotations Limited, manual setup needed, can require the usage of additional supporting packages such as patchwork or cowplot Simple and intuitive with HeatmapAnnotation()
Dendrograms Not supported Supported and automatic
Customisability Extremely flexible for all plot types Specialised for heatmaps

In summary, ggplot2 is a general-purpose plotting package and can handle a wide variety of visualisations. In contrast, ComplexHeatmap excels in producing heatmaps with features commonly needed in bioinformatics analysis, such as clustering and sample annotations (especially in the case where many groupings are present).

Overall, this practical has aimed to showcase ggplot2 and its capabilities, but also it’s occasional limitations. Sometimes it is better to use a tool that is already specialised for your needs, rather than attempting to code something from scratch. Understanding the strengths of specialised tools is essential in making informed choices for visualising data effectively in your analyses.

Further reading/resources


References

Krzywinski, Martin, and Naomi Altman. 2014. “Visualizing Samples with Box Plots.” Nature Methods 11 (2): 119–20. https://doi.org/10.1038/nmeth.2813.
Singhania, Akul, Raman Verma, Christine M. Graham, Jo Lee, Trang Tran, Matthew Richardson, Patrick Lecine, et al. 2018. “A modular transcriptional signature identifies phenotypic heterogeneity of human tuberculosis infection.” Nature Communications 9 (1): 2308. https://doi.org/10.1038/s41467-018-04579-w.
Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. New York, NY: Springer. https://doi.org/10.1007/978-0-387-98141-3.
———. 2016. “Data Analysis.” In, edited by Hadley Wickham, 189–201. Use R! Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-24277-4_9.